A COMPUTATIONAL METHOD FOR THE IDENTIFICATION OF 
CANDIDATE PROTEINS USEFUL AS ANTI-INFECTIVES 



Field of Invention 

The present invention provides a novel method for the identification of candidate proteins 
in pathogens useful as anti-infectives. More particularly, the present invention relates to 
candidate genes for these proteins. The invention further provides new leads for development of 
5 candidate genes, and their encoded proteins in their functional relevance to predictive, preventive 

or curative approaches. This computational method involves calculation of several sequence 
X{ attributes and their subsequent analysis lead to the identification of some outlier proteins in 
D different pathogens. Thus, the present invention is useful for identification of some of the outlier 
41 proteins in pathogenic organisms these outlier proteins are either virulence proteins or antigens 

l .; S 

K) or used as drug targets. The outlier proteins from different genomes constitute a set of candidates 

O 

W for functional characterization through targeted gene disruption, microarrays and proteomics. 

b: Further, these proteins constitute a set of candidates for further testing in development of anti- 

r infectives such as vaccine candidates, diagnostics or drug targets. Also, are provided the genes s 
encoding the candidate proteins. 

1 5 Background of the Invention and Prior Art Discussion 

The progress in genome sequencing projects has generated a large number of inferred 
protein sequences from different organisms and, it is likely to increase in the coming years. The 
availability of complete genome sequences offers an opportunity for increased understanding of 
the biology of these organisms because it not only provides biological insights on any given 
20 organism, but also provides substantially more information on the physiology and evolution of 



microbial species through comparative analysis (Fraser et al 2000). The set of microbes whose 
genomes have been sequenced so far is a diverse one, ranging from organisms living under 
extreme condition of environment to model organisms of biology, and to some of the most 
important human pathogens (http://www.ncbi.nlm.nih.gov). 

5 It is expected that the availability of the information on the complete set of proteins from 

the infectious human pathogens will enable us to develop novel drugs to combat them. This is 
important in cases such as the emerging epidemic of multiple drug-resistant Mycobacterial 
isolates (Barry et al. 2000) although, so far, no new drugs derived from genomics-based 
discovery have been reported to be in a development pipeline (Black and Hare 2000). A 
Ijgf paradigm for exploiting the genome to inform the development of novel antituberculars has been 
K proposed, utilizing the techniques of differential gene expression as monitored by DNA 
2 microarrays coupled with the emerging discipline of combinatorial chemistry (Barry et al. 2000). 

®„ The whole genome sequences of microbial pathogens also present new opportunities for 

II clinical applications such as diagnostics and vaccines (Weinstock et al. 2000). However, the 
|| predicted number of proteins encoded in different genomes is fairly large, and about half of that 
in any given genome is of unknown biological function (Fraser et al. 2000). Some of them are 
also unique in each organism. In this scenario, development of data mining tools and their 
application to decipher useful patterns in the protein sequence dataset can be useful for suitable 
experiments such as differential gene expression, heterologous expression for large-scale 
20 (Weinstock et al 2000) and proteomics studies (Chakravarti 2000). Recently, it has been 

demonstrated that utilization of genome sequences by application of bioinformatics through 
genomics and proteomics can expedite the vaccine discovery process by rapidly providing a set 
of potential candidates for further testing (Chakravarti 2000 (a) and (b)). Presently data mining 



is being carried out using traditional computer programs that perform motif search or identify 
distinct domains differing in physico-chemical properties such as hydrophobicity, sequence 
conservation. The drawback of these methods is that the functions of a half to one third number 
of the proteins remain unknown even after their applications. Therefore the application of the 

5 presently available computation tools it is likely that potential new candidate for vaccines, 
diagnostics or drug targets are missed. Therefore, need exists for development of a 
computational tool that uses different sequence attributes of protein sequences instead of 
sequence patterns. Through such a shift in framework, the applicants have overcome this 
limitation. The novelty of the present invention is in development of method based on different 

10 attributes of protein sequences, which is useful for prediction of functional role in virulence, 

Til immuno-pathogenicity and drug-response. 
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Objects of the Invention 

The main object of the present invention is to provide a computational method for 
5 identification of proteins useful as anti-infectives. These anti-infectives are vaccine candidates, 
diagnostics or drug targets. 

Another object of invention is to provide proteins with unusual sequence characteristics 
identified as outliers in different pathogens. 

f/ % Yet, another object of the invention is for providing the use of gene sequences encoding 

til 

Wi the proteins useful as candidate anti-infectives. 

m Summary of the Invention 

iU! The present invention relates to a computational method for the identification of 

tf candidate proteins useful as anti-infectives. The invention particularly describes a novel strategy 

a . 

to identify outlier proteins in different genomes of pathogens. These anti-infectives are vaccine 
15 candidates, diagnostics or drug targets. 

Detailed Description of the Invention 

Accordingly, the present invention provides a novel computational method for the 
identification of candidate proteins in pathogens useful as anti-infectives. Computational 
algorithms based on general principles are used to carry out data mining to decipher useful 
20 patterns for sequence characterization and classification. This computational method involves 
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calculation of several sequence attributes and their subsequent analysis lead to the identification 
of some outlier proteins in different pathogens. Thus, the present invention is useful for 
identification of some of the outlier proteins in pathogenic organisms. These outlier proteins are 
either virulence proteins or antigens or used as drug targets. The outlier proteins from different 
5 genomes constitute a set of candidates for functional characterization through targeted gene 
disruption, microarrays and proteomics. Further these proteins constitute a set of candidates for 
further testing in development of anti-infectives such as vaccine candidates, diagnostics or drug 
targets. Also, are provided the genes encoding the candidate proteins. 

The invention provides a set of candidate proteins and genes for further evaluation as 

pi 

l|f diagnostic or vaccine candidate or useful for testing in diagnostics or drug susceptibility for 
|! f human pathogens. The method of the invention is based on the analysis of protein sequence 
J attributes instead of sequence patterns linked to biochemical functions. Present method is 
I independent of the discrepancy inherent with such an approach. The invention provides a 
W computational method, which involves multivariate analysis using Priniciple Component 
|P Analysis (PCA). The proteins termed 'outliers' were found to be excluded from the protein 
^ clusters in various pathogens' genomes. Several unique sequences were located on homology 
analyses of these 'outliers' protein sequences with those in Swiss Prot and PER. database. Some 
outlier sequences turned out to be identical or homologous to the virulent proteins implicated 
with antigenic and drug susceptible responses. By this approach, proteins could be identified 
20 (short-listed) for further testing in development of anti-infectives in pathogenic organisms. 

Computational algorithms based on general principles are needed to carry out data 
mining to decipher useful patterns for sequence characterization and classification. 
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The invention has utility for providing new leads for development of anti-infectives of 
diagnostic, preventive and curative potential. 

Present invention relates to a computational method for the identification of candidate 
proteins useful as anti-infectives. 

5 Accordingly, the present invention provides a novel method for identifying the candidate 

proteins useful as anti- infectives, said method comprising: 

i) calculating computationally the different sequence based attributes from all the protein 
sequences of the selected pathogenic organisms, 

HI ii) clustering computationally all the proteins of a genome based on these sequence-based 

lS attributes using Principle Component Analysis. 

iii) identifying computationally the outlier proteins sequences which are excluded from 
7" the main cluster. 

5 iv) matching the outlier protein sequences with the protein sequences in various 
% databases. 

15 v) selecting the unique outlier protein sequences not homologous to any of the protein 

sequences searched above. 

vi) validating computationally the protein sequences as anti-infectives by comparing with 
the known protein sequences that are biochemically characterized in the pathogen, genome. 

In an embodiment of the present invention, the protein sequence data is taken from any 
20 organism, specifically but not limited to organisms such as B.burgdorfei, C.jejuni, 
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C.pneumoniae, C.trachomatis, H.influenzae, H.pylori, L.major, M.genetalium, M.pneumoniae, 
M.tuberculosis, N.meningitis, P.aeruginosa, P.falciparum, R.prowazekii, T.pallidum, V.cholerae. 

In another embodiment of the present invention, different sequence-based attributes used 
for identification of candidate anti-infective proteins are selected from the group comprising of 
5 fixed protein and variable protein attributes. 

In still another embodiment of the present invention, the fixed protein attributes are 
selected from the group comprising of percentage of charged amino acids, percentage 
hydrophobicity, distance of protein sequence from a fixed reference frame, measure of dipeptide 
t* % complexity of protein, and measure of hydrophobic distance from a fixed reference frame. 

l|| In yet another embodiment of the present invention, the variable attribute is the distance 

i %$ 

IK of the protein sequence from a variable reference frame. 

iij In one more embodiment of the present invention, the cluster analysis is carried out by 

H Principle Analysis Technique using correlation coefficient between the attributes. 

li! 

III In one another embodiment of the present invention, the steps i to iv and vi are performed 

W* computationally. 

In an embodiment of the present invention, the clustering of the proteins is based upon 
analysis of sequence attributes instead of sequence pattern linked to biochemical functions. 

In another embodiment of the present invention, the unique outlier protein sequences 
non-homologous to the known anti-infective sequences specifically in the following pathogens 
20 but not limited to, such as B.burgdorfei, C jejuni, C.pneumoniae, C. trachomatis, H.influenzae, 
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H.pylori, L.major, M.genetalium, M.pneumoniae, M.tuberculosis, N.meningitis, P.aeruginosa, 
P.falciparum, R.prowazekii, T .pallidum, V.cholerae. 

In still another embodiment of the present invention, the unique outlier sequences 
obtained by the method of invention that can serve as potential anti-infective candidates as 
listed in Table 1 and list. 

In yet another embodiment of the present invention, the unique outlier hypothetical 
protein sequences from pathogenic genomes that can serve as anti-infective candidates listed in 
Table 2. 

In one more embodiment of the present invention, the genes encoding the unique proteins 
useful as anti-infectives. 

In one another embodiment of the present invention, the computer system comprises of a 
central processing unit, executing DISTANCE program, clustering of the protein sequences 
based on different attributes using by Principle Component Analysis, all stored in a memory 
device accessed by CPU , a display on which the central processing unit displays the screens of 
the above mentioned programs in response to user inputs; and a user interface device. 

In an embodiment of the present invention, the unique outlier hypothetical protein 
sequences from pathogenic genomes that can be used for diagnostic purpose. 

In another embodiment of the present invention, the unique outlier hypothetical protein 
sequences from pathogenic genomes that can be used as vaccine candidates. 

In still another embodiment of the present invention, the unique outlier hypothetical 
protein sequences from pathogenic genomes that can be used for therapeutic purposes. 
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Unique outlier protein sequences non-homologous to the known anti-infective sequences 
specifically in the following pathogens but not limited to such as B.burgdorfei, C .jejuni, 
C.pneumoniae, C.trachomatis, H.influenzae, H.pylori, L.major, M.genetalium, M.pneumoniae, 
M.tuberculosis, N.meningitis, P.aeruginosa, P.falciparum, R.prowazekii, T.pallidum, V.cholerae. 

Unique outlier protein sequences obtained by the method of invention that can serve as 
potential anti-infective candidates and having known properties are listed in Tablel and List. 

Unique outlier hypothetical protein sequences from pathogenic genomes that can serve as 
anti-infective candidates listed in Table2. These protein sequences have hypothetical functions. 

The list contains all the protein sequences that were marked as outlier by clustering 
method. These sequences were obtained from NCBI database. 

Other and further aspects, features and advantages of the present invention will be 
apparent from the following description of the presently preferred embodiments of the invention 
given for the purpose of disclosures. 

Description of tables and sequence lists 

The List contains all the protein sequences that were marked as outliers by clustering 
method.These sequences were obtained from NCBI database: 
www.ncbi.nlm.nih.gov/genomes/bacteria. 

Table 1 gives the list of outlier proteins with known functions. 

Table 2 gives the list of outlier proteins with hypothetical functions. 
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Brief description of computer program: 

Software program was written in PERL (Practical Extraction and Reporting Language) 
and operated on a Silicon Graphics Origin 200 using IRIX 6.5 operating system. The computer 
program gives a numerical data of the different attribute column wise for each protein in one 
5 record along with its GI number. The values in each column represent the values of the different 
variates in the multivariate analysis. Using the rationale described above we have developed the 
data mining software and a software copyright has been filed. 

Statistical Analysis. 

O All statistical procedures were carried out using the SAS package (SAS Institute 

lpj Inc.USA). Principal Component Analysis using correlation coefficients between the variates was 

if *"i 

% carried out using this package, 
b! Sequence analysis 

jUj Homology analysis was carried out using the Wisconsin Package Version 10.0 5 Genetics 

O Computer Group (GCG), Madison, Wisconsin. 

1 5 Details of the Invention 

The whole genome sequences of microbial pathogens present new opportunities for 
clinical applications such as diagnostics and vaccines (Weinstock et al. 2000). The present 
invention provides new leads for the development of candidate genes, and their encoded proteins 
in their functional relevance to drug responses for use in predictive, preventive or curative 
20 approaches. 
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The protein sequences of several pathogens were obtained computationally from the 
existing databases (NCBI, genbank/genomes/bacteria). Different sequence attributes like 
hydrophobicity, charge and measures of compositional distance and dipeptide complexity by a 
specially developed computer program 'DISTANCE' was used for computation. The attribute 
5 profile was obtained for all the proteins for each of the pathogenic genome. These sequence- 
based attributes were then used to carry out cluster analysis by Principal Component Analysis 
technique using correlation coefficients between the attributes. The proteins falling outside the 
protein cluster in each genome were identified and termed as outlier proteins. These outlier 
proteins were compared by BLAST with the sequence of known protein anti-infectives to 

o 

i§ identify potential candidate for anti-infective lead molecules which can be envisaged to be useful 
III for predictive, preventive and curative purposes against pathogenic infections. 

J Accordingly, the invention provides a computer-based method for identifying the 

s candidate proteins useful as anti-infectives , which comprises: 

In 1 . calculating computationally the different sequence based attributes from all the protein 

$1 sequences of the selected pathogenic organisms. 

2. clustering computationally all the proteins of a genome based on these sequence-based 
attributes using Principle Component Analysis. 

3. identifying computationally the outlier proteins sequences which are excluded from the 
main cluster. 

20 4. matching the outlier protein sequences with the protein sequences in various databases. 

5. selecting the unique outlier protein sequences not homologous to any of the protein 
sequences searched above. 
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6. validating computationally the protein sequences as anti-infectives by comparing with 
the known protein sequences that are biochemically characterized in the pathogen genome. 

In an embodiment of the invention the protein sequence data may be taken from any 
organism, specifically but not limited to organisms such as B.burgdorfei, C.jejuni, 
C.pneumoniae, C. trachomatis, H.influenzae, H.pylori, L.major, M.genetalium, M.pneumoniae, 
M.tuberculosis, N.meningitis, P.aeruginosa, P.falciparum, R.prowazekii, T .pallidum, V.cholerae. 

In an embodiment, the non-homologous outlier protein sequence may be compared with 
that of known anti-infective sequences in the selected pathogens. Several unique outlier 
sequences were identified to be similar to known to play a role in anti-infectives. These unique 
sequences obtained by the method of invention can serve as potential anti- infective candidates. 

In another embodiment to the present invention different sequence-based attributes used 
for identification of candidate anti-infective proteins comprise charge, hydrophobicity, distance 
from fixed and variable point of reference, hydrophobic distance and dipeptide complexity. 

In another embodiment, the attributes may be of fixed type or variable type. 

In another embodiment of the invention the computer system comprises a central 
processing unit, executing DISTANCE program, clustering of the protein sequences based on 
different attributes using by Principle Component Analysis, all stored in a memory device 
accessed by CPU , a display on which the central processing unit displays the screens of the 
above mentioned programs in response to user inputs; and a user interface device. 

The particulars of the organisms such as their name, strain, accession number in NCBI 
database and other details are given below: 
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Genomes 



Accession No. No. of bp(s) 



Date of 
completion 





B.burgdorfei 


NC 


001318 


910724 bp 


Decl7,1997 


5 


Cieiuni 


NC 


002163 


1641481 bp 


Febl 0,2000 




C. pneumoniae CWL029 


NC" 


" 000922 


1230230 bp 


Dec 1,1998 




C.trachomatis 


NC 


000117 


1042519 bp 


May20,1998 




H.influenzae 


NC 


000907 


1830138 bp 


Jul25,1995 




H.pylori 


NC 


000915 


1667867 bp 


Aug6,1997 


10 


L.major 






chromosome 1 






M.genitalium 


NC 


000908 


580074 bp 


Jan8,2001 




M.pneumoniae 


NC 


000912 


816394 bp 


Jan 1,1900 




M.tuberculosis 


NC 


"000962 


4411529 bp 


Junl 1,1998 




N.meningitis MC58 


NC 


"002183 


2272351 bp 


Feb25,2000 


15 


P.aeruginosa 


NC 


002516 


6264403 bp 


May 16,2000 




P. falciparum 






chromosome 2,3 






R.prowazekii 


NC 


000963 


1111523 bp 


Novl2,1998 




T.pallidum 


NC 


000919 


1138011 bp 


Mar 6,1998 


i[] 


V.cholerae 


NC 


002505 


2961149 bp 


Junl4, 2000 


2§ 




NC 


002506 


1072315 bp 


Junl4, 2000 





Genomes 


Total number 
of proteins 




B.burgdorfei 


850 


2^ J 


C.jejuni 


1634 




C. pneumoniae 


1052 




C.trachomatis 


894 


In \ 


H.influenzae 


1709 




H.pylori 


1553 


m ' 


L.major 


683 




M.genetalium 


467 




M.pneumoniae 


677 




M.tuberculosis 


3918 




N.meningitis 


2025 


35 


P.aeruginosa 


5565 




P.falciparum 


422 




R.prowazekii 


834 




T.pallidum 


1031 




V.cholerae 


3828 



40 



Another embodiment of the invention is the use of the genes encoding the proteins 
identified by the methods of the invention. 
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Brief Description of the Accompanying Drawings 

In the drawing accompanying the specification, 

Figure 1 represents the one of the bivariate relationship for M. tuberculosis. 

The invention is further explained with the help of the following examples which are 
given by illustration and should be construed to limit the scope of the present invention in any 
manner. 

EXAMPLES 

Example 1: 
DISTANCE: 

The purpose of the program is to computationally calculate various sequence-based 
attributes of the protein sequences. 

The program works as follows: 

The internet downloaded FASTA format files obtained from 
http://www.ncbi.nlm.nih.gov were saved by the name <organism_name>.faa are passed as input 
to the PERL program which computes the different attributes of protein sequences. 

Input/Output format: 

Downloaded Files and their format: 

<organism_name>.faa: file which stores the annotation and the protein sequence. 

<organism_name> refers to 

mtub (Mycobacterium tuberculosis) B.burgdorfei, 

16 



Bbur (Borellia burgdorferi), Bsub (Bacillus subtilus), Cjej (Campylobacter jejuni), 
Cpneu (Chlamydia pneumoniae), Ctra (Chlamydia trachomatis), Hi (Hemophilus 
influenzae),Hpyl(Helicobacter pylori),Lp(Leishmania major), Mg (Mycoplasma 
genitalium), Mp (Mycoplasma pneumoniae),Mtub (Mycobacterium tuberculosis), 
5 Nmen ( Neisseria meningitis), Paer (Pseudomonas aeraginosa),Pfal (Plasmodium 

falciparum), Rp (Rickettsia prowazekii), Tpal (Treponema pallidum), Vcho (Vibrio 
cholerae) 

C | Format: FASTA 

*t\ ">gi|" <annotation> 

'i \i 
if '"3 

It « the entire protein sequence 

I" For example, 

W >gi|23 14605 |gb|AAD08472| histidine and glutamine-rich protein 

5 MAHHEQQQQQQANSQHHHHHHAHHHHYYGGEHHHHNAQQHAEQQAEQQAQ 
QQQQQQAHQQQQQKAQQQNQQY 

15 >gi|3261822|gnl|PID|e328405 PE_PGRS 

MIGDGANGGPGQPGGPGGLLYGNGGHGGAGAAGQDRGAGNSAGLIGNGGAGG 
AGGNGGIGGAGAPGGLGGDGGKGGFADEFTGGFAQGGRGGFGGNGNTGASGG 
MGGAGGAGGAGGAGGLLIGDGGAGGAGGIGGAGGVGGGGGAGGTGGGGVAS 
AFGGGNAFGGRGGDGGDGGDGGTGGAGGARGAGGAGGAGGWLSGHSGAHG 
20 AMGSGGEGGAGGGGGARGEAGAGGGTSTGTNPGKAGAPGTQGDSGDPGPPG 

17 



>gi| 

The output file: <organism_name>.mdis 

Format: 



for example format of mtub.mdis: 



Gene name 


Lengt 


%Hydroph % 


Dfixed Dvar,hig 


Dipepti D 


phobic 




h(L) 


obicity charge h 


de 










complexit 






>gi|2808711|gnl|PID|e12 


507 


49.9 


25.44 63.06 Y 53.38 


90 


53.18 


45984 












>gi|3261513|gnl|PID|e12 


402 


60.95 


21.39 68.64 40.88 


81 


60.3 


99736 












>gi|1552556|gnl|PID|e26 


385 


58.18 


27.27 71.16 43.13 


79 


59.25 


6921 












>gi|1552557|gnl|PID|e26 


187 


56.15 


25.67 34.79 23.17 


22 


29.1 


6922 












>gi|1552558|gnl|PID|e26 


714 


51.12 


27.87 87.22 80.66 


154 


77.04 


6923 












>gi|1552559|gnl|PID|e26 


838 


53.46 


27.33 116.0 88.15 


196 


97.71 


6924 






2 






>gi|1552560|gnl|PID|e26 


304 


61.84 


17.11 54.21 34.79 


49 


47.55 



6925 

>gi| 

Example 2: 

Fixed Protein attributes: 

We developed a framework for statistical analysis using the following attributes of 
proteins. The attributes used here are the hydrophobicity, charge, and different types of 
compositional characteristics of a protein. Each attribute was quantified using a measure and 
each measure uses a reference frame for computation defined later in this section. 
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The attributes were treated as variates in the statistical analysis. The variates were 
classified into two categories, namely, 'fixed' and 'variable'. In the case of 'fixed' variates, the 
reference frame for analysis of different organisms (genomes) is fixed. Thus the reference frame 
in these cases is not organism specific. For example, a particular scale of hydrophobicity is fixed 
5 for the analysis of protein sequences across all organisms. In the case of 'variable' variates, the 
reference frame for analysis of different organisms (genomes) varies from one to another. In 
these cases, the reference frame is organism specific. 

In this work, we have included variates with reference frames that are not organism 
specific and that are organism specific because our objective was to analyze the different 
W characteristics of the proteins in one module to enable us to draw inferences with significance 

\lt and practical utility. Thus proteins falling as outliers based on all these variates have very 

pi 

2* different characteristics in general and also from the rest members of the genome. 

S J 

; 5t L is the length of the protein in number of amino acids. 

|y The group of charged amino acids, hydrophobicity scale used, expected number of 

l6l occurrences of different amino acids, expected number of different dipeptides in a protein, 

expected number of hydrophobic amino acids - based on a particular hydrophobicity scale - each 
constitute a reference frame for the different measures used in this work. These measures are 
described below. 

Fixed variates: 

20 Variate 1 : is the percent of charged amino acids in a given protein. The charged amino acids 
were Aspartic acid (D), Glutamic acid (E), Lysine (K) and Arginine (R). 

% of Charge is given by 

19 



Number of charged amino acids 
L 



X100 



(1) 



Variate 2: is the percent hydrophobicity of the protein. We have used several hydrophobic scales 
given by Fauchere & Pliska scale (Fauchere and Pliska, 1983), Hopp & Woods(1981), Kyte & 
Doolittle(1982) and Rose scale (Rose et al. 1985) to classify the amino acids into hydrophobic 
and hydrophilic groups respectively. 

Percent Hydrophobicity is given by 

Number of hydrophobic amino acids X 100 (2) 

L 

Variate 3: is a measure of distance of a protein sequence from a fixed reference frame. The 
distance is measured according to the formula: 



Ox is the observed number of xth amino acid in the protein and Ex is the expected number of xth 
amino acid in the same protein. In this case Ex is L/20 considering all amino acid to be 
uniformly distributed in the fixed reference frame. Dfixed /L is a normalized measure of 
distance for the protein. 

Variate 4: is a measure of the dipeptide complexity of a protein. The reference frame here is the 
maximum number of dipeptides possible in the protein for its length. The measure is given by 

(i) for proteins of L < 800 amino acids 




(3) 
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No. of different peptides observed in the protein 
(L/2) 



(4) 



and 



(ii) for proteins of L > 800 amino acids 



No. of different peptides observed in the protein 



(5) 



400 



Variate 5: is a measure of hydrophobic distance of a protein in a genome from a fixed reference 
frame. 



Ox is the observed number of xth hydrophobic amino acid in the protein and Ex is the expected 
number of xth hydrophobic amino acid in the same protein. In this case, 

Ex= total no. of hydrophobic amino acids in the protein 



The computation of Ex assumes uniform distribution of the different hydrophobic 
amino acid types; z = the number of types of hydrophobic amino acids identified according to a 
particular hydrophobic scale. This is the fixed reference frame, z will vary according to the 
hydrophobic scale used. For example in the Kyte & Doolittle scale z is 13, in Hopp and Woods 
scale z is 1 1, in Fauchere & Pliska scale z is 1 1, and in Rose scale z is 8. Dphobic /L is a 
normalized measure of hydrophobic distance of a protein. 




z 
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Example 3: 

Variable Protein Attributes: 

Variate 6: is the distance of a protein sequence in a genome from a variable reference frame. In 
this case the distance D var, high complexity has the same formula as that in Variate 3 but Ex is 
5 calculated according to the formula: 

Ex-fx X L (7) 

where fx is the frequency of occurrence of the xth amino acid in the set of proteins that 
are of 'high sequence complexity'within the same genome. For this purpose, we first run the 
j\ protein sequences encoded in the genome through our sequence complexity analysis computer 
M program (Ramachandran et al) and classify the proteins into 2 sets, namely, 'high complexity' 
HI and 'low complexity' according to the fraction of the low complexity sequences present in each 
m protein. 

hi The frequency of each of the 20 amino acids from the high complexity set of proteins was 

0 computed by calculating the number of occurrences of the xth (x = 1 to 20) amino acid in the 
t$ proteins set divided by the total number of amino acids in the same set. The frequency of 

occurrences of different amino acids in this dataset is referred to as the variable reference frame 
because the frequency of the different amino acids appearing in the high complexity set of 
proteins are unequal to each other and varies from one genome to another. As in Variate 3, Dvar, 
high complexity / L is a normalized measure of distance with respect to the variable reference 
20 frame. 
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Example 4: 

Clustering by Principle Component analysis 

A representation of one of the bivariate relationship for M.tuberculosis is shown in 
Figurel. The ellipse of confidence limit at 80% is also shown. The relationship between the 

5 variate Dfixed/L and % hydrophobicity shows that most of the proteins in different genomes 
cluster into a large dense group. A few proteins tend to fall outside the cluster in different 
organisms. Similar observations were made with all types of bivariate plots and with all 
organisms (data not shown). These observations indicate the clustering nature of the proteins 

t*% from different organisms with respect to the protein attributes, and this feature could explain the 
ill nature of uniformity observed in the distribution patterns discussed in the previous section. The 

O proteins that fall outside the clusters are termed as 'outliers' in this work. The number of outliers 

4* in different organisms vary from one organism to another, 

f I In the present invention the most widely used hydrophobicity scales, charge composition, 

W and various distance measures based on amino acid frequencies have been used. When one 
W hydrophobicity scale is used instead of other, then the list of the outliers changes only very 

slightly. Most of the outlier proteins are common to all the 4 scales. We have included in our list 
all the outliers identified using all the 4 different scales of hydrophobicity each taken one at a 
time. 

A comprehensive study has been done to identify the outliers in different genomes by 
20 principal components analysis at 0.8 of cumulative proportion of variance. The number of 
outliers identified in different genomes is given (Table 1 & 2). It is evident that the number of 
outliers does not have a clear relationship with the total number of proteins encoded in the 
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different genomes. This indicates that the properties of the outlier proteins do not follow a 
common trend with respect to the number of proteins encoded in a genome (or the genome size). 
The number of outliers in the case of P. falciparum and L. major is with respect to the partial 
genomic sequences. A clearer picture will emerge after the whole genome is sequenced and the 
5 protein coding regions are identified. 

Example 5: 

Prediction of anti-infective annotation in M, tuberculosis 

Seven outlier sequences were identified in M. tuberculosis (Table 1& 2). Among these 
O three protein sequences correspond to glycine rich protein PE_PGRS (Poly E rich proteins) of M. 
lffi tuberculosis. The amonoacid sequences of these can be retrieved from NCBI database 
;i. (http://www.ncbi.nlm.nih.gov). The PEJPGRS proteins have been implicated in virulence in 
I ; I this pathogen (Ramakrishnan et al 2000). These unique outlier protein sequences can there fore 
O be predicted to be potential candidates for anti-infective approach. 

pi Example 6: 

f9 Prediction of anti-infective annotation in H. pylori 

Eight outlier sequences were identified in H.pylori (Table 1 & 2). The bacteria lacking 
one these outlier i.e histidine rich protein, cultured in vivo, are more susceptible than is the wild 
type to bismuth and Ni2+ (Mobley et al 1999). These unique outlier protein sequences can there 
fore be predicted to be potential candidates for anti-infective approach. 
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Example 7: 

Prediction of anti-infective annotation in P.falciparum 

Five outlier sequences were identified in P.falciparum (Table 1 & 2). The 
circumsporozite protein was evaluated as a vaccine candidate (Kester et al 2001). These unique 
outlier protein sequences can there fore be predicted to be potential candidates for anti-infective 
approach. 

The particulars of the organisms such as their name, strain, accession number in NCBI 
database and other details are given below: 
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Genomes 


Accession No. 


No. of bp(s) 


Date of No. 










completion 


B.burgdorfei 


NC. 


001318 


910724 bp 

ST 


Decl7,1997 


C.jejuni 


NC 


.002163 


1641481 bp 


Febl0,2000 


C.pneumoniae CWL029 


NC 


_000922 


1230230 bp 


Dec 1,1998 


C.trachomatis 


NC 


.000117 


1042519 bp 


May20,1998 


H. influenzae 


NC. 


.000907 


1830138 bp 


Jul25,1995 


H.pylori 


NC 


.000915 


1667867 bp 


Aug6,1997 


L.major 






chromosome 1 




M.genitalium 


NC_ 


.000908 


580074 bp 


Jan8,2001 


M.pneumoniae 

JT 


NC. 


.000912 


816394 bp 


Jan 1,1900 


M.tuberculosis 


NC. 


.000962 


441 1529 bp 


Junl 1,1998 


N.meningitis MC58 


NC. 


.002183 


2272351 bp 


Feb25,2000 




NC. 


.002516 


6264403 bt> 


Mavl6 2000 


P.falciparum 






chromosome 2,3 




R.prowazekii 


NC. 


.000963 


1111523 bp 


Novl2,1998 


T. pallidum 


NC. 


.000919 


1138011 bp 


Mar 6,1998 


V.cholerae 


NC. 


.002505 


2961149 bp 


Junl4, 2000 




NC. 


.002506 


1072315 bp 


Junl4, 2000 



Genomes Total number 
of proteins 

B. burgdorfei 850 

C. jejuni 1634 
C.pneumoniae 1052 
C.trachomatis 894 
H.infiuenzae 1709 
H.pylori 1553 
L.major 683 
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M.genetalium 


467 


M.pneumoniae 


677 


M.tuberculosis 


3918 


N.meningitis 


1\)2j 


P. aeruginosa 


JDOJ 


P. falciparum 


422 


R.prowazekii 


834 


T.pallidum 


1031 


V.cholerae 


3828 



Tablel: List of proteins with known functions 



Organism 


GI Number 


Protein function 


SEQ ID NO: 


Eubacteria 




CJ 


6967728 
6969129 
6968493 
6968611 


highly acidic protein 
small hydrophobic protein 
putative coiled coil protein 
highly acidic protein 


SEQ ID NO: 1 
SEQ ID NO: 2 
SEQ ID NO: 3 
SEQ ID NO: 4 


CP 


4376663 


histone like protein2 


SEQ ID NO: 5 


CT 


3522902 
3328438 


hypothetical protein-possible 

frameshiftwithCT593 
histone like protein2 


SEQ ID NO 
SEQ ID NO 


6 
7 


HI 


1573353 
1574049 
1574645 
1573009 


tolA 

thiamin ABC transporter 
heme exporter protein B 
recombination protein 


SEQ ID NO 
SEQ ID NO 
SEQ ID NO 
SEQ ID NO 


8 
9 
10 
11 


HP 


2313421 
2314604 
2314605 


poly E-rich protein 

histidine rich, metal binding polypeptide 
histidine and glutamine rich protein 


SEQ ID NO 
SEQ ID NO 
SEQ ID NO 


12 
13 
14 


MG 


1046012 
1046097 


cytaadherence accessory protein 
cytaadherence accessory protein 


SEQ ID NO 
SEQ ID NO 


15 
16 


MP 


1674069 


adhesin related protein 


SEQ ID NO 


17 


MTUB 


3261822 
2894254 
2924449 
1781260 


PE PGRS 
PE PGRS 
PE PGRS 
PPE 


SEQ ID NO 
SEQ ID NO 
SEQ ID NO 
SEQ ID NO 


18 
19 
20 
21 


PAER 


9947600 


KdpF 


SEQ ID NO 
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9951563 
9951352 


alginate regulatory protein AlgP 
PhaF 


SEQ ID NO: 23 
SEQ ID NO: 24 


TP 


3323280 


dicarboxylate transporter 


SEQ ID NO: 25 


VCHO 


9654609 
9656364 


iron (III) ABC transporter, permease 
tolA 


SEQ ID NO: 26 
SEQ ID NO: 27 


Eukaryotes 


LM 


1743289 
468328 


hydrophilic surface protein 2 
hydrophilic surface protein 


SEQ ID NO: 28 
SEQ ID NO: 29 


PF 


3845179 
4493889 


predicted integral membrane protein 
circumsporozite protein 


SEQ ID NO: 30 
SEQ ID NO: 31 



Table2: list of hypothetical proteins 



Organism 


GI 

Number 


SEQ ID NO 


GI 

Number 


SEQ ID NO 


Eubacteria 


BB 


2688482 


SEQ ID NO: 32 


2688343 


SEQ ID NO: 37 




2688046 


SEQ ID NO: 33 


2688447 


SEQ ID NO: 38 




2688045 


SEQ ID NO: 34 


2688540 


SEQ ID NO: 39 




2688103 


SEQ ID NO: 35 


2688768 


SEQ ID NO: 40 




2688333 


SEQ ID NO: 36 


2688793 


SEQ ID NO: 41 


CJEJ 


6967728 


SEQ ID NO: 42 


6968409 


SEQ ID NO: 46 




6967819 


SEQ ID NO: 43 


6968423 


SEQ ID NO: 47 




6968034 


SEQ ID NO: 44 


6968200 


SEQ ID NO: 48 




6968265 


SEQ ID NO: 45 






CP 


4377009 


SEQ ID NO: 49 


4377196 


SEQ ID NO: 54 




4377120 


SEQ ID NO: 50 


4376483 


SEQ ID NO: 55 




4377121 


SEQ ID NO: 51 


4376770 


SEQ ID NO: 56 




4377216 


SEQ ID NO: 52 


4376779 


SEQ ID NO: 57 




4376866 


SEQ ED NO: 53 


4376756 


SEQ ID NO: 58 


CT 


3328515 
3329021 


SEQ ID NO: 59 
SEQ ID NO: 60 


3329121 


SEQ ED NO: 61 


HI 


1574537 


SEQ ID NO: 62 


1574799 


SEQ ED NO: 65 
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1 C7 A A*\ A 
I J /4414 


bEQ ID NU. 63 




CCA T"P\ ATA. £LC 

!SEQ ID NU: 66 




1574625 


SEO ID NO* 64 


1574607 

L +S f ~\J\J I 


SEO ID NO* 67 


HP 

JUL J. 


2313229 


SEO TD NO- 68 


2313894 

Zs^J X J07T 


SEO TD NO- 71 

OX-fV,/ XX^/ I N \J , 1 I 




2313552 


bEQ ID NU: 69 


2314686 


S>EQ ID NU: II 




231 ioo4 


oEQ ID iNU. /U 






ItXVT 


1 045905 


SEO TD NO- 73 


104581 1 


SFO TD NO- 74 

OX-fV,/ XX-/ IMW. / i 


Mr 


1 674046 


0~CA JT^k \TA. -"7C 

SEQ ID NO: 75 


1674374 


bbQ ID NO: 78 




1673719 


SEO ID NO- 76 


1673775 


SEO ID NO* 79 




16 /3 / /2 


bEQ ID N(J: / / 






1% fTT TT> 

Ml UB 


2113965 


O T7 /"~\ TT~"\ \TA. OA 

SEQ ID NO: 80 


O AAA /I A A 

2909499 


SEQ ID NO: 82 




2117265 


SEO ID NO* 81 






JNM 


7225315 


bEQ ID NO: 83 


7227030 


bEQ ID NO: 86 




7226708 


SEO ID NO* 84 


7227104 


SEO TD NO* 87 




7226768 


SEO ID NO* 85 


7226645 


SEO ID NO* 88 


PAER 


9947556 


SEO ID NO* 89 


9948900 


SEO ID NO* 91 

UJ— XX-/ J- > W . -/ i 




yy4y^jj 


tpi xirv on 
ID INU. yu 


00/101 cn 


CT7A TTA XTO* OO 

£>Es\l ijj jnu. yz 


T>D 

Jvr 


5ouUojz 


dEQ ID iNU. yj 


JouUoM 


ID NU: 94 


IF 


3322751 


SEQ ID NO: 95 


3322546 


SEQ ID NO: 96 


VCHO 


A/TC /I A AA 

9654409 


SEQ ID NO: 97 


9657724 


SEQ ID NO: 102 




0654544 


SEO TD NO* 98 

ox-rV^ xx-/ y o 


9657931 


SFO TD NO- 1 03 






SEQ ID NO: 99 


9658035 


SEQ ID NO: 104 




9656707 


SEQ ID NO: 100 


9658254 


SEQ ID NO: 105 




9657609 


SEQ ID NO: 101 


9656580 


SEQ ID NO: 106 


Eukaryotes 


Pathogens 


PF 


3845248 


SEQ ID NO: 107 


4493994 


SEQ ID NO: 109 




3845292 


SEQ ID NO: 108 


4494004 


SEQ ID NO: 110 


LM 


6996498 


SEQ ID NO: 111 


6562665 


SEQ ID NO: 115 




6978417 


SEQ ID NO: 112 


6996509 


SEQ ID NO: 116 




6899670 


SEQ ID NO: 113 


6433946 


SEQ ID NO: 117 




6899664 


SEQ ID NO: 114 


5869911 


SEQ ID NO: 118 
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Advantages 

The method of the invention for identifying unique protein sequences useful as anti- 
infectives is ab initio. It does not need a teaching data set. These anti-infectives are useful as 
vaccine candidates, diagnostics, and drug responses. The method uses sequence attributes instead 
of sequence patterns. The invention is generally applicable to all genomes and is easy to 
implement in any setting. This approach results into reproducible results as the method not 
depend on variable biochemical characterization of proteins. However, functional information 
from other systems is helpful in aiding testable predictions. The method of the invention can be 
used for newly sequenced pathogens to provide a set of candidates for rapid evaluation for the 
development of anti-infectives. 
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